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Sequence comparison is a widely used computational technique in modern molecular biology. In 
spite of the frequent use of sequence comparisons the important problem of assigning statistical 
significance to a given degree of similarity is still outstanding. Analytical approaches to filling 
this gap usually make use of an approximation that neglects certain correlations in the disorder 
underlying the sequence comparison algorithm. Here, we use the longest common subsequence 
problem, a prototype sequence comparison problem, to analytically establish that this approximation 
does make a difference to certain sequence comparison statistics. In the course of establishing this 
difference we develop a method that can systematically deal with these disorder correlations. 

PACS numbers: 87.15.Cc, 87.10.+e, 02.50.-r, 05.40.Fb 



I. INTRODUCTION 

Sequence comparison gathers interest from a wide vari- 
ety of fields such as molecular biology, biophysics, math- 
ematics, and computer science. Methods of compari- 
son concern computer scientists who use string compari- 
son for everything from file searches to image processing 
[l|, 0j Si 13 ■ Biological sequence comparison provides de- 
tails into the building blocks of life by allowing the func- 
tional identification of newly found sequences through 
similarity to already studied ones. Thus, it has become 
a standard tool of modern molecular biology. 

As with all pattern search algorithms, a crucial com- 
ponent for the successful application of sequence com- 
parisons is the ability to discern the biologically mean- 
ingful from randomly occurring patterns. Thus, a thor- 
ough characterization of the strength of patterns within 
random data is mandatory to establishing a criterion for 
discerning meaningful data [3- 

The most commonly used sequence alignment algo- 
rithms are the closely related Needleman-Wunsch Q ^-iid 
Smith-Waterman [3 algorithms. There have been nu- 
merous numerical and analytical studies that attempt to 
characterize the behavior of thes e alg orithms on random 
sequence data i, i, IH IH El Im Q El E3- How- 
ever, there are difficulties with both kinds of approaches 
to the problem of characterizing sequence alignment al- 
gorithms statistically. The numerical methods are by far 
too slow to be useful in an environment where tens of 
thousands of searches are performed on a daily basis and 
users expect their results on interactive time scales. The 
analytical methods on the other hand, while in principle 
able to rapidly characterize sequence alignment statis- 
tics, are only valid in small regions of the vast parameter 
space of the sequence alignment algorithms. 

In addition to being restricted to a small region of pa- 
rameter space, current analytical methods have another 
drawback: they rely on an approximation to the actual 
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alignment algorithm that ignores some subtle correla- 
tions within the sequence disorder. Here, we want to 
demonstrate that such correlations do matter and pro- 
pose an analytical approach that can in principle deal 
with these correlations for certain finite size variants of 
sequence alignment introduced in section Hill 

We will concentrate on the simplest prototype of a se- 
quence alignment algorithm, namely the longest common 
subsequence (LCS) problem. More complicated models 
of sequence alignment can be adapted to the methodol- 
ogy presented here in a straightforward manner. How- 
ever, in the interest of clarity and efficiency, we pro- 
ceed with the simple LCS problem in mind. In the 
longest common subsequence problem similarity between 
two randomly chosen sequences over an alphabet of size 
c is measured by the length of the longest string that can 
be constructed from both sequences solely by deleting 
letters. The central quantity characterizing the statistics 
of the LCS problem is the expected length of this longest 
common subsequence. Its fraction of the total sequence 
length in the limit of infinitely long sequences is called 
the Chvatal-Sankoff constant Oc. 

Although the LCS problem is one of the simplest 
alignment algorithms, the value of the Chvatal-Sankoff 
constant has been remarkably elusive. So far, analyt- 
ical stabs at Or have led to exact solutions for very 
short lengths llTl and proofs for upper and lower bounds 
El i, i, [H El El lil EH- Based on numerical re- 
sults, there existed a long-standing conjecture for the 
value of the Chvatal-Sankoff constant. Recently |23.l23l |. 
this conjecture has been proven to hold true for the ap- 
proximation to the LCS problem that precisely ignores 
the disorder correlations mentioned above. Very careful 
and extensive numerical treatments d 0, |2l |2J, [2I |2g 
have revealed that the true Chvatal-Sankoff constant (in- 
cluding all disorder correlations) deviates slightly from 
its value in the uncorrelated approximation. This paper 
seeks to introduce a systematic way of understanding the 
LCS problem with all disorder correlations included and 
to establish in an analytically tractable environment that 
uncorrelated and correlated disorder indeed lead to dif- 
ferent results. 



The format of this paper wiU be to summarize the LCS 
problem in section^ This section includes a general de- 
scription of the LCS problem and outlines a commonly 
used paradigm for solving for the LCS. In addition, sev- 
eral conventions which are utilized throughout the pa- 
per are defined here. In section IIIII wc introduce the 
finite width model (FWM) method. In order to spare 
the reader possibly distracting mathematical details we 
discuss only the overall ideas in the main text and reserve 
appendix^for the more detailed discussion of the math- 
ematical methods employed in FWM. In section IIVI we 
give the results of the FWM method for the correlated 
and uncorrelated LCS problem and discuss the differ- 
ences between these two problems that become obvious 
in the FWM treatment. Section Ivl summarizes our find- 
ings. 



II. REVIEW OF THE LCS PROBLEM 
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The LCS of two sequences is the longest sequence that 
can be formed solely by deletions in both sequences [T^l ■ 
Best described by example, the LCS of 'DARLING' and 
'AIRLINE' is 'ARLIN', with a subsequence length of 5. 
Given two sequences of length M and N , xiX2---Xm and 
2/i2/2---yw, over an alphabet of size c, their LCS can 
be computed in 0{MN) time. This computation may 
be conveniently visualized with a rectangular grid such 
as the one shown in Fig. ^ In this example, for the 
two sequences, xiX2...xg = '001001' and yiy2---y6 = 
'010110', the LCS, '0100', has a length of 4. Within 
the grid used to find the LCS, all horizontal and vertical 
bonds are assigned a value of 0. Each diagonal bond is 
designated a value depending on the associated letters in 
its row and column. Matching letters earn their diagonal 
bonds a value of 1 , while non-matching letters result in an 
assignation of 0. Then, each directed path across these 
various bonds from the first lattice point in the upper-left 
to the last in the lower-right as drawn in Fig. ^ corre- 
sponds to a common subsequence of the two sequences. 
The only restriction is that the path may never proceed 
against the order of the sequences. It may only move 
rightward, downward, or right-downward in Fig. ^ The 
length of a common subsequence corresponding to a path 
is the sum of the bonds which comprise that path. Solv- 
ing this visual game for the length of the LCS requires 
that we find the path of greatest value. This value will 
be the length of the LCS. 

Recursively, we may define this problem by introducing 
the quantity £{i,j) as the LCS of the substrings xiX2---Xi 
and yiy2---yj- Defining the LCS of two substrings in this 
way allows us to find the LCS leading to each of the 
lattice points in Fig. ^ This in turn breaks our path 
search down into the more manageable steps. 
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(1) 



FIG. 1: Grid representation of the longest common subse- 
quence (LCS) problem. The dashed line highlights a solution 
to the LCS of the two binary sequences on the edges of the 
square. Notice that there exist multiple solutions to this prob- 
lem. 



where 



and 



Vihj) 



1 Xi = yj 
otherwise 



£{o,j) = e{t,o) = o. 



(2) 



(3) 



Of course, once we evaluate the final k{AI, N), we have 
solved for the length of the LCS. Ultimately, we wish to 
evaluate the central quantity that characterizes the LCS 
problem, the Chvatal-Sankoff constant Qc- It character- 
izes the ensemble of LCS's of pairs of randomly chosen 
sequences with M = N independently identically dis- 
tributed (iid) letters. If we denote averages over the en- 
semble by {■■■)n, the Chvatal-Sankoff constant may be 
defined as 



,. {LCS)n 
Qc = hm — 



(4) 



This can be interpreted as the average growth rate of the 
LCS of two random sequences. 

As evident from Fig. 1 and the recursive equations (Q- 
(O the length of the LCS depends on the sequences only 
via the values of the ry's. If we define the probabilities of 
or 1 occurring in our random sequences as p or 1 — p, 
respectively, where < p < 1, each individual rj carries 
a 2(1 — p)p probabihty of being zero, and a. p^ + {1 — p)'^ 
of being 1. However, the different ri{i,j) are not chosen 
independently according to those probabilities, but are 
subject to subtle correlations. 



It is very tempting to neglect these correlations in favor 
of choosing N"^ iid variables r]{i,j) according to 



riihj) 



1 with probability 1 
with probability q 



(5) 



where q = 2(1 — p)p, the probability of a bond value 
being 0. We will call this the uncorrclated LCS prob- 
lem and identify all quantities calculated for this prob- 
lem by an additional hat; specifically, we will call a^ the 
analog of the Chvatal-Sankoff constant in the uncorrc- 
lated LCS problem. This approximation to the real LCS 
problem has been used in various theoretical approaches 
to sequence comparison statistics [13, [la 123 ■ For the 
LCS problem itself, only very careful numerical studies 
could show that the Chvatal-Sankoff constant for the cor- 
related and uncorrclated problem are actually different 
0, II, 123 . 12^ . However, there are very real differences. 
The differences arise due to the fact that the correlated 
and uncorrclated cases allow different sets of possible 
bond or rj values. In the uncorrclated case all combi- 
nations of bond values are allowed to exist. The grid in 
Fig. n makes it obvious that there are NM bonds each 
with 2 possibilities. Therefore, there must exist 2^*^ 
unique configurations of bond values. Meager by compar- 
ison are the 2^*^+^"^ cases allowed by the two sequences 
of length M and N, with each letter having the capacity 
to take on one of two values in the correlated case. No- 
tice the missing factor of 2 in the correlated case arises 
due to the fact that one can alway replace all O's with 
I's in order to get the same sequences of matches cutting 
our possible number of bond value configurations in half. 
A more concrete realization of the limited possibilities in 
the correlated case comes simply by noticing that there 
are only two bond value configurations values any row 
or column can take up in Fig. ^ Additionally, a column 
or row with a value of must be a mirror opposite of a 
column or row with a value of 1. And so we can see that 
not only does the uncorrclated case account for sets of 
bond values that cannot exist, it cannot mimic the spe- 
cific relationship between different rows and columns of 
bond values. Wc will reveal these differences in a simple 
approximation to the LCS problem that allows for closed 
analytical solutions in the correlated and uncorrclated 
case. 



III. FINITE WIDTH LCS 

The Finite Width Model (FWM) we wiU use in the 
following overlays the grid presented in Fi g. ^1 with the 
restriction in width presented in Fig.|21|23,[23- We will 
measure the width W of such a grid by the number of 
bands that make up the lattice, i.e. M^ = 2 in Fig. |2 
Although, the grid used to analyze the LCS must be 
truncated to a finite width W, our finite strip extends 
to an infinite length. Thus, we can still define a width 




FIG. 2: This picture shows our 45° counter-clockwise rota- 
tion to achieve the orientation from which we will proceed. 
The blow up defines the lattice site values (k- values), the 
match values (ry- values), and the lattice site diflterence val- 
ues (h- values). It also defines our time and width axes. The 
dashed lines connect the sites between which our h- values are 
measured. 



dependent Chvatal-Sankoff constant 

a.iW) ^ lim i^Mk 



(6) 



In addition to the width W, the growth rate also depends 
on the sequence composition. In our case of alphabet size 
c ~ 2 this is characterized by the probability p to find a 
' ' on each site within the sequences. While the method 
we will present below in principle enables us to calculate 
any ac{W,p), we will, in the following, concentrate on 
the simple example adW = 2,p) shown in Fig.[5| Notice 
that 



hm a^iW, i) 



(7) 



from below, and thus ac{W, ^) produces a series of lower 
bounds to the Chvatal-Sankoff constant. 

On the finite width lattice shown in Fig. |3 it is conve- 
nient to redefine our quantities. Aside from the narrower 
scope under which we investigate the LCS problem, all 
other properties of the grid problem remain the same. In- 
stead of referring to our lattice points by the coordinates 
i and j, we now utilize a time axis, t, which points along 
the allowed, sequence forward, direction as well as a co- 
ordinate axis, x, which lies perpendicular to the i-axis. 
In place of old £- values, in this new coordinate system 
we introduce fc-values. Keeping track of these fc-values 
can be simplified to a new set of recursive relationships 
at each time step. In the notation defined by Fig. |5] for 
width IT = 2, 



fc(2,t+ 1) = max 



fc(2,t) 
fc(l,t) 



^(2,t) 



(8) 



k{l,t+l) 



fc(0,t+l) 



fc(2,t+l) 
fc(l,t) + r;(l,t) 
fc(0,t+l) 



fc(l,t) 
fc(0,t) 



ry(0,t) 



(9) 



(10) 



where the 77's take on values of either 1 or in the same 
way that we assigned vahies to our diagonal lines in 
Fig. n Our set of recursive relationships gives us the 
longest path value up to each lattice site. The length of 
the FWM LCS then becomes k{l,N). 

Related to our fc-values by equations Hll|l and (|12() , we 
define the /i- values in order to describe the relative values 
of our lattice sites within any given time frame. Utilizing 
the diagrammed definitions of Fig. |21 



h{l,t) = fc(l,t)-fc(2,t) 
h{0,t) ^k{l,t) -k{0,t). 



(11) 
(12) 



The recursion relations ||HJ),® and (|10|l . can be expressed 
entirely via this new quantity as 



Hl,t+1) 



h{0,t+l) 





h{l,t)+7^{l,t)-r{l,t) 
h{l,t) + s{0,t)-r{l,t) 





h{0,t)+jj{0,t) 
h{0,t) + s{l,t) 



r(0,t) 
r(0,t) 



where 



r(l,f) 
r(0,<) 
s(l,t) 
s(0,t) 



max {77(2, i),/i(l,f)} 
max {77(0, i),/i(0,i)} 
max{0,r;(2,t) ~ h{l,t)} 
max{0,?7(0,t)-/i(0,<)} 



(13) 



(14) 



(15) 
(16) 
(17) 
(18) 



Several properties conveniently arise from these defini- 
tions. First, notice that the /i- values are independent of 
the absolute fc-values. Furthermore, it may be shown by 
inspection that the /i-values may only take on the values 
(0, 1). Inspecting Fig.|21 we note that each adjacent set of 
fc-values share a nodal fc. This node attaches itself to the 
adjacent sites via a bond of value or 1. Since the nodal 
fc holds only a single value and the single bonds lead- 
ing to the adjacent sites can only change this value by 
-1-1, the only ft,- values allowed then become or 1. Hav- 
ing detached the absolute fc-valucs from the FWM-LCS 
problem entirely, we may further detach the entire FWM 
from the grid in Fig. ^ Originally we noted that our 
FWM-LCS on this grid becomes fc(l, A^). However, in or- 
der to calculate ac{W,p), the length of the LCS problem 
has to be increased infinitely along the time axis. Time 
now becomes an unbounded axis in the FWM. Since the 
difference \k{x,t) — k{y,t)\ is bounded by W for all x 
and y, the length of the LCS may be measured at any x. 




FIG. 3: The diagram maps the influence of letters from the 
two sequences ..., Xt-i, Xt, ... and ..., yt-i, yt, ■■■ on the new 
orientation. The letter information required for the evolution 
from time f — 1 to time t is shown here. The dashed lines 
represent the chosen configuration for the h's. 



The growth rate may then be expressed as the average 
fc-values increase along any coordinate, i.e. 



ac{W,p) = (fc(x,t)-fc(.T,t-l)) 



(19) 



Notice that k{x,t) — k{x,t — l)e{0, 1} and that as a 
result our newly defined ac{W,p) carries the condition 
0<ac{W,p)<l. 

The formulation given by equations (|13|) . H14|l make 
it clear that h{l,t) and h{0,t) can be calculated if 
h{l,t- l),h{0,t-l), ?]{2, t - 1), r;(l, t- 1), and ?/(0, t - 1) 
are known. This allows us to write the time evolution 
as a Markov model. In order to do so, the informa- 
tion required for the time evolution at each time step 
must be included in the states. For our uncorrelated 
states, where the 77's occur randomly, only ft(l,t — 1) 
and h{0,t — 1) are required to determine the probable 
time evolution. Therefore, the uncorrelated states sim- 
ply read (ft,(l,t— 1), h{0,t~l)). The probabilities for the 
time evolution into the state (/i(l,i), h{0,t)) may then 
be calculated based on the 77-value probabilities given in 
equation ^. However, in the correlated case, the 77's 
are not randomly chosen. Instead, the letters in each 
sequence are according to the probability p of a oc- 
curring at a single site within the sequences. Some let- 
ters affect 77's across multiple time steps as shown by 
Fig- El In order to calculate the time evolution to time t, 
77(2, i— 1), 77(1, t—1), and77(0, i— 1) must be known. These 
77's depend on the subsequences Xt-i,Xt and yt-i,yt- 
Once these four letters are known, the state at time t 
may be determined. Since these letters cast an influence 
across multiple time steps, their information must be re- 
tained in order to accurately forecast the upcoming pos- 
sibilities for the 7;'s and our states. Redefining our states 
as (ft(l, t), ft(0, t), Xt, yt) preserves the necessary informa- 
tion. The remaining information for calculating the 77's 
needed for the time evolution, mainly Xt+i and j/t+i, arise 
according to the letter probabilities. These probabilities 
contribute to the probable time evolution into the next 
t + 1 state {h{l,t + 1), h{0,t + 1) , xt+i , yt+i) ■ It can be 
shown that correlated states must always contain W h- 
values and W letter values, and that uncorrelated states 




FIG. 4: The dashed hnes represent the various configurations 
by which the /i- values can be defined. Note that each of the 
various sets of /i-values implies a different definition of the 
state, and thus requires a different set of letters. The arrows 
represent the letters which are required in each different con- 
figuration. 



must always contain W /i- values. 

Though the number of elements in a state depends 
only on the width, there exist alternative means of writ- 
ing our states. We are free to choose whatever configu- 
ration of continuous lines to define our h-values across. 
Fig. 0] show the other possible configurations in width 2 
FWM. Naturally, the letter effects differ for each shape, 
and the proper letters for each configuration are also il- 
lustrated. The various states wc may form all contain the 
same number of h and letter values and obey the same 
principles. Only the specified set of h and letter values 
differ. 

Whatever state we choose to define, the Markov pro- 
cess describing FWM-LCS is characterized by a transfer 
matrix T. This matrix describes the transitions from a 
state in one time to a state in it's immediate future, ft is 
a representation of the dynamics given by Eqs. (fO|l - (f11^ . 
We leave the mechanics of obtaining this transfer matrix 
to Appendix IXI and focus here on the results. 

Once wc have found the transfer matrix we may solve 
for the vector s describing the steady state by solving the 
linear system of eigenvalue equations 



T ■ s ~ s 



(20) 



subject to the normalization condition f • s = I where 
I = (1,1,1,1,...). Note that the size of these vectors 
depends on the number of states needed to describe 
the problem. More specifically, 16 elements are needed 
for the correlated width 2 FWM while the uncorrelated 
width 2 case only requires 4 elements. This steady state 
vector must contain the probabilities to observe every 
single state in the random ensemble. Note that the di- 
rectness of this technique allows for it's ready adaptation 
to more complex sequence comparison algorithms along 
the lines of Ref. [15|. However, this generally requires 
a significantly larger number of states thus incurring a 
greater computational cost. 

In order to describe the growth rate, wc utilize a 
growth matrix G to mark the transitions which result 
in growth along some chosen coordinate. The process by 
which we construct the growth matrix bears great simi- 
larity to the process by which we construct the transfer 
matrix. In fact, the growth matrix only omits those el- 



ements of the transfer matrix that do not contribute to 
the growth along a chosen coordinate. Further detail re- 
garding the construction of the growth matrix has been 
left for Appendix 1X1 

The growth matrix allows us to define the growth vec- 
tor, g 



5-1-G 



(21) 



This growth vector describes the probable growth from 
each of the states. Coupled with the steady state, which 
provides us with the likelihood of each state, this allows 
us to solve for ac{W,p) directly as 



ac(W,p) = g- 



(22) 



since the probability of growth from t -^ t + 1, as de- 
scribed by the growth matrix, is independent of the prob- 
ability to be in a certain state at time t. 

Before we discuss the results of this approach, we 
would like to point out that this technique is not lim- 
ited to the calculation of the growth rate ac{W,p). Since 
the dynamics of the scores is a Markov process, any quan- 
tity can be calculated once the transfer matrix T and the 
steady state vector s are known. E.g., any equal-time 
correlation function of interest can be obtained directly 
from the steady-state vector s simply by summing over 
the degrees of freedom that are not to be included in the 
correlation function while a time-correlation function like 
{i{t)\j{t')) (the probability to be in state i at time t given 
that the system was in state j at time t') is simply given 
by {i{t)\j{t')) = i ■ T*~* j where i and j arc vectors, all 
entries of which are zero except for a one in the row for 
state i or J, respectively. 

Solving the correlated width W ^2 FWM-LCS prob- 
lem utilizing the process given by equations (|2()() - (|22|1 . we 
arrive at the equation 



a2{W = 2,p) = 



3~5p + 5p2 



3 - p - 3p2 



4p4 



(23) 



where p represents the probability of the first letter oc- 
curring. The same methodology may be applied to the 
uncorrelated case, where we describe the transition prob- 
abilities using the bond probability q defined by equation 
(El. 



02(2,9) = 



5-7q + 2g2 
5 — 5q + q'^ 



(24) 



Notice that the specifics of the state that wc choose does 
not impact the result in any way. Nor does the choice we 
make with respect to measuring the growth. Any combi- 
nation of choices result in equations H23|l and H24|l for the 
correlated and uncorrelated cases respectively. We ex- 
plicitly verified this independence in the choice of configu- 
rations and definitions of the growth. These independent 
results serve as a powerful check for the correctness of the 
algebraic manipulations. Substituting q = 2(1 — p)p into 



equation (|25|l . the probability of getting two different let- 
ters, or a bond value of 0, gives a an equation expressed 
in the same quantities as the correlated Chvatal-Sankoff 
constant given by equation (|23|l . mainly 

^ .„ . 5-14p + 22p^-16p3 + 8p4 
a2{2,p) = ,^^ , -,, o o^.s , ..„4 (25) 



5 - lOp + 14p2 



IV. RESULTS 



4p4 



Now, we will apply our method to various small width 
cases and discuss the implications of the results for the 
longest common subsequence problein. First, we check 
our computations, and plot the results Eqs. (|23|l and H25|) 
alongside numerical data obtained by random sampling 
in Fig. |S1 The numerical data obtained by choosing 
10,000 pairs of random sequences of length 10,000, cal- 
culating their width 2 LCS and averaging shows no dis- 
cernable deviation from the analytical results over the 
whole range of the parameter p. Already in this plot for 
W = 2, the differences between the correlated and un- 
correlated cases arc apparent. Coinciding only for p — 
and p — 1 where growth is certain in every step, the two 
cases differ at all other points. 

Then, we look at the width dependence of the growth 
rates at the symmetric point p = 1/2. They are summa- 
rized in Table [J The results again verify the difference 
between the correlated and uncorrelated cases with the 
growth rate in the uncorrelated case being systematically 
higher than in the correlated case. They also highlight 
two rather interesting exceptions. The first occurs for 
the case W = in which correlations play no role and 
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FIG. 5: Analytical and numerical data provided by FWM. 
This plot shows further evidence verifying the correctness of 
FWM. Numerical modeling obtained by passing many ran- 
dom sequences through an FWM evaluation produces the 
data points represented. The analytical FWM model matches 
the numerical data with high precision for both the correlated 
and uncorrelated cases at width W = 2. The error for the nu- 
merical data presented is smaller than the symbol size. 



indeed have no meaning. Assigning a random bond value 
(uncorrelated) or two random letters (correlated) lead to 
the same effect. Thus, as expected, the correlated and 
uncorrelated cases plot identically for W = 0. The sec- 
ond exception, occurring for W = I, narrows the scope 
of equality to three values of p, namely 0, ^, and 1. In 
these cases, the equality arises from the exactly simi- 
lar bond values being produced from each case. In all 
other respects, the correlated and uncorrelated versions 
of W^ = I differ. 

Viewing the solutions of Table HI also shows that the 
growth rate Oc increases with width W. This agrees with 
perfectly with the expectations of the FWM. As width 
increases so do the possibilities for growth. In fact, in 
the limit W^ — > oo we recover the Chvatal-Sankoff con- 
stant - the infinite width growth rate. Along the way, 
these finite width values of Oc provide lower bounds to 
the Chvatal-Sankoff constant. In this way, FWM conve- 
niently provides a method for gathering systematic solu- 
tions for obtaining lower bounds to the Chvatal-Sankoff 
constant. The values given for Oc in this table may be 
read as a series of ever increasing analytically solved lower 
bounds. In this systematics, FWM displays one of it's ad- 
vantages over conventional methods. However, the power 
and exactness of these solutions exacts a computational 
cost that grows as 2^^^. 




FIG. 6: Finite width growth rates as a function of the letter 
probability p for different widths. 



width W 



correlated case 



uncorrelated case 




1 
2 
3 

4 
5 

oo 



i (0.5) 
I (0.6) 

ITJ (0-7) 

i|§ (0.723307587) 

3900482569 

5288762638 

1016932681760084189805278879341973703014985562 

1359136362951380586870384955918322158719785917 

(0.8126) 



(0.737503808) 



(0.748219759) 



i (0.5) 

I (0.6) 

^ (0.72) 

i (0.75) 

i§ (0.771573604) 

|2| (0.781838317) 

(0.828427125) 



TABLE I; Finite- width Chvatal-Sankoff constant a2{W, |) for correlated and a2{W, ^) for uncorrelated disorder. As a reference 
the value of the Chvatal-Sankoff constant for infinite width is given. In the correlated case it is only known numerically |25|: 
in the uncorrelated case it is given by 2/(\/2 + 1) |23. |2^. 



Next, we consider the dependence of the growth rates 
on the letter probabihty p. Fig. El shows the full ana- 
lytical solutions for various widths plotted as a function 
of p. These graphs verify the trend noted from the dis- 
cussion of Table ^ However, they allow another inter- 
esting observation: while the difference in values in the 
correlated and uncorrelated cases may be immediately 
perceived, the shape of each of the curves appears to 
not depend on W. With increasing W the curve sim- 
ply appears to come closer and closer to one. In order 
to verify this, we rescale the difference 1 — ac{W,p) of 
the growth rate from one by its value 1 — adW, 1/2) at 
p = 1/2. As shown in Fig. [7| these rescalcd curves are 
indeed indistinguishable for W = 2,3,4, and 5. They 
clearly fall into two distinct classes, namely a curve 
for the correlated case and a curve for the uncorre- 
lated case. For the uncorrelated case, where the result 
ac{W = oo,p) = 2/lV + (1 - p)^)"^/^ + 1] for infinite 
width is known |22. |23 | , Fig. [7| also shows perfect agree- 
ment between the finite W and the infinite W results. 
Thus, at least in the uncorrelated case there are no no- 
ticeable finite size effects in the scaling function even for 
widths as small asW = 2. Assuming the absence of finite 
size effects even for small widths also holds true for the 
correlated case for which we cannot independently ver- 
ify this assumption, the results shown in Fig. [7| support 
two important conclusions: (i) the correlated and un- 
correlated systems truly and systematically differ for all 
widths and thus also in the limit VF — > oo, and (ii) these 
curve shapes can be understood as universal properties 
of the correlated and uncorrelated FWM-LCS system in- 
dependent of the width W. It implies that, given the 
value of Qc for any p ^ or 1 one may plot Uc for all 
values of p. In other words, a single data point suffices 
to define a finite width system whether it be correlated 
or uncorrelated. 

The differences between the correlated and uncorre- 
lated case, highlighted by Fig. [7| result from the subtle 
restrictions that correlations place on bond values. As an 
example, for width 2 FWM, three bond values contribute 
to a single transition. Thus 2^^ = 8 unique sets of bond 
values exist. Uncorrelated bond values allow for any of 
these 8 possibilities at any given time. However, because 



T 1 1 1 1 1 1— 




FIG. 7: Rescaled growth rate for W = 2, 3, 4, 5 in the cor- 
related and uncorrelated case as well as the W — oo result 
for the uncorrelated case. All results for the correlated case 
and all results of the uncorrelated case are virtually indistin- 
guishable from each other while the correlated growth rate 
clearly follows a pattern that is distinctly different from the 
uncorrelated growth rate. 



the letters effect correlated bonds in multiple time steps, 
each correlated state has only 4 allowed transitions. In 
fact, in any width the FWM provides a maximum of 4 
allowed transitions for all correlated states. The reasons 
for this are elucidated in Appendix ^ In addition to 
the number of possibilities lacking in the correlated case, 
the allowed transitions create subtle relationships creat- 
ing patterns of growth that differ significantly from the 
uncorrelated case. These differences account for the sys- 
tematic separation viewed in Fig. |7| 



V. CONCLUSION 

We conclude that within the FWM method differences 
between the correlated and uncorrelated LCS problem 
can be established analytically. The dependence of the 
finite width growth rate on the letter probability p fol- 
lows a scaling law already for the relatively small widths 
which are analytically accessible. These scaling laws are 







FIG. 8: Above the four possible futures or transitions avail- 
able to the state (0,0,0,0) are obtained diagrammatically. 
These transitions, reading from the upper left, are (0,0,0,0), 
(1,0,1,0), (0,1,0,1) and (1,1,1,1). Note that the states are or- 
ganized with h values first, then letters both written in from 
the top to the bottom in this diagram. In order to help clar- 
ify the origin of these four sets of numbers, the quantities 
relevant to the new states have been starred. 



distinctively different for the correlated and uncorrelated 
case within FWM thereby providing an analytical argu- 
ment that the differences betwrecn the correlated and un- 
correlated case explicitly revealed for small finite widths 
here may persist in the limit of infinite widths. This is 
the first piece of analytical evidence that hints at the 
distinctness of the Chvatal-Sankoff constants in the cor- 
related and uncorrelated cases. However, though there 
exists an analytical solution for the infinite width uncor- 
related case, it should be noted that no such solution for 
the infinite width correlated case is available. Thus this 
evidence has only been analytically verified for widths up 
to 5 for correlated finite width systems, and the pattern 
suggested by this data set may yet be the result of some 
finite width effect. Nonetheless, the FWM method in 
itself provides a systematic means to deal with these cor- 
relations that can be generalized from the LCS to other 
sequence comparison problems. 



APPENDIX A: OBTAINING THE TRANSFER 
AND GROWTH MATRICES 



Our transfer matrix, as discussed in the main text, de- 
scribes transitions from one state into the next. It allows 
us to determine the probable fraction of time spent in 
any state, i.e. the steady state, and coupled with the 
growth matrix it allows us to calculated the growth rate. 
Obtaining the matrix elements involves finding all tran- 
sition probabilities and placing them into our matrix. To 
begin, one simply takes a state and writes all possible 
transitions out of this state. When one has done this for 
all possible states, then the transfer matrix is complete. 
As an example we have calculated the first column of the 
transfer matrix in the correlated case W = 2. 

Starting with the first column, which represents our 
(0, 0, 0, 0) state, we note that there exist only four pos- 
sible futures. Once we choose the two remaining letters 
as (0, 0), (1, 0), (0, 1) or (1, 1) the differences h are 
completely determined. Fig. |S1 shows the determination 
of the state transitions that result from these four sets of 
letters. These four transition then become the matrix ele- 
ments of the first column. The probability weighing each 
transition is determined by the new set of letters that 
bring about the new state, or the starred letters in Fig.|Sl 
In the order listed above, the states they bring about are 
weighed by the probabilities p^, (1 — p)p, p{l — p), and 
(l-p)2. 

In order to formulate a growth matrix, we pick the 
line defining the growth, and delete the elements of the 
transfer matrix which do not contribute to the growth on 
this line. In this example we have chosen to measure our 
growth along the bottom line. As an example, in Fig. |S1 
the two top diagrams contribute to growth because the 
lattice value along the bottom line grows in both these 
cases. However, for the bottom pair, the lower lattice 
value remains static, thus their contributions are missing 
from the growth matrix shown below. 

Repeating this for each possible starting state leads 
to the following matrix representations where the states 
are ordered from least to greatest in binary (0000, 0001, 
0010, 0011,...). 
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